Introduction

This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data

Course coordinates

spds.uni.kn

Preparations

Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)

## Essential commmands | Data science for psychologists
## 2018 07 05
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##

## Preparations: ----- 

library(tidyverse)

## Visualize data and EDA: ggplot and dplyr ----- 

# ...

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- 

Visualizing data

In the following, we introduce some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.

See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) and the links provided below for more detailed information.

Commands and examples

General structure of ggplot calls

A generic template for creating a graph with ggplot is:

# Generic ggplot template: 
ggplot(data = <DATA>) + 
  <GEOM_fun>(mapping = aes(<MAPPING>), <arg_1 = val_1, ..., arg_n = val_n>) +
  <FACET_fun> +    # optional
  <LOOK_GOOD_fun>  # optional 
  
# Minimal ggplot template:
ggplot(<DATA>) + 
  <GEOM_fun>(aes(<MAPPING>) 

The generic template includes the following parts:

  • <DATA> is a data frame or tibble that contains the data that is to be plotted.

  • <GEOM_fun> is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified in aes(<MAPPING>). (A “mapping” specifies what goes where.)

  • A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
    1. in the aesthetic mapping (when varying visual features according to data properties), or
    2. by setting its arguments to specific values in <arg_1 = val_1, ..., arg_n = val_n> (when remaining constant).
  • An optional <FACET_fun> splits a complex plot into multiple subplots.

  • A sequence of optional <LOOK_GOOD_fun> adjusts the visual features of plots (e.g., by adding themes, plot titles and labels, color scales, and coordinate systems).

Some examples that illustrate the use of these components are:

A histogram

A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:

library(ggplot2)



# Data: ------ 
# Using mpg data:
?ggplot2::mpg
mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (A) Histogram: ------

# A minimal histogram:
hi1 <- ggplot(mpg, aes(x = cty)) +  # set mappings for ALL geoms
  geom_histogram(binwidth = 1) 
hi1


# The same histogram:
hi1b <- ggplot(mpg) +
  geom_histogram(aes(x = cty))      # set mappings for THIS geoms
hi1b


# (B) Adding aesthetics, labels and themes: ------ 

# Enhanced version of the same plot: 
hi2 <- ggplot(mpg) +
  geom_histogram(aes(x = cty), binwidth = 1, fill = "forestgreen", color = "black") +
  labs(title = "Distribution of fuel economy in city environments", 
       x = "cty (miles per gallon)",
       caption = "Data from ggplot2::mpg") +
  theme_light()
hi2

A scatterplot

A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data:


# (A) Scatterplot: ------ 

# A minimal scatterplot + reference line:
sp1 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline()
sp1

Dealing with overplotting

A common issue with scatterplots is so-called overplotting: Multiple points appear on the same position.

Here are some ways of dealing with this issue:

  1. jitter adds randomness to positions;
  2. alpha uses transparency to show frequency of positions;
  3. geom_size allows mapping values (e.g., frequency) to object size;
  4. facet_wrap allows disentangling plots by levels of variables.

Some examples include:

## Dealing with overplotting: ----- 

# 1. One way of dealing with overplotting is 
#    adding randomness to point positions:  
sp2 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "jitter") +
  geom_abline()
sp2


# 2. Another way of dealing with overplotting is 
#    using transparency (via setting alpha to < 1): 
sp3 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), position = "identity", 
             pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
  geom_abline(linetype = 2, color = "firebrick") # + 
  # geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
sp3


# Adding labels and themes to plots: 
sp4 <- sp3 +   # use the plot defined above
  labs(title = "Fuel economy on highway vs. city",
                x = "City (miles per gallon)",
                y = "Highway (miles per gallon)",
                caption = "Data from ggplot2::mpg") +
  # coord_fixed() +
  theme_bw()
sp4


# (C) Grouping (by a categorical variable): ------  

# Using facets to avoid overplotting: 
sp5 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy)) +
  geom_abline() + 
  facet_wrap(~class) +
  theme_bw()
sp5


# Grouping by color:
sp6 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy, color = class), 
             position = "jitter", alpha = 1/2, size = 4) +
  geom_abline(linetype = 2) +
  theme_bw()
sp6


# Grouping by facets: 
sp7 <- ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy), 
             position = "jitter", alpha = 1/2, size = 2) +
  geom_abline(linetype = 2) +
  facet_wrap(~class) +
  theme_bw()
sp7

See https://ggplot2.tidyverse.org/reference/ for more examples.

Note some details:

  • ggplot requires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted <DATA> is in a table (data frame or tibble) in long format and contains independent variables as factors.

  • The arguments data = and mappings = can be omitted, but an aesthetic mapping aes(<MAPPING>) for at least one geom is needed.

  • Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).

  • When multiple geoms use the same mappings, their common aes(<MAPPING>) can be moved into the initial ggplot call (behind <DATA>).

  • In ggplot, a sequence of commands is combined by +, rather than %>%.

  • The visual appearance of plots are highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots).

EDA

Creating good graphs is both an art and a craft. The key to creating good graphs requires answering 2 sets of questions:

  1. Knowing the number and type of variables to be plotted. This includes answering data-related questions like

    • How many variables are there to plot?
    • Are these variables categorical or continuous?
    • Do some variables qualify (e.g., group) the values of others?
  2. Knowing the intended type of plot. This includes answering functional questions like

    • What is the purpose of this plot?
    • What are possible plots for this purpose?
    • Which of these would be the most appropriate plot?

Even when the questions of 1. and 2. are answered, creating good graphs with ggplot requires a lot of practice and many hours of trial-and-error experimentation.

Basic plot types

Histograms

A histogram shows counts of the values of 1 (typically continuous) variable. This is useful for evaluating the distribution of the variable:

library(ggplot2)
 
# Create data: 
tb <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 15))
 
# Basic histogram:
ggplot(tb) + 
  geom_histogram(aes(x = iq), binwidth = 5)


# Pimped histogram: 
ggplot(tb) + 
  geom_histogram(aes(x = iq), binwidth = 5, 
                 fill = "gold", color = "black") +
  labs(title = "Histogram", x = "IQ values", y = "Frequency in sample (n)",
       caption = "[Using random iq data.]") +
  theme_classic()

More on histograms:

Scatterplots

A scatterplot shows relationship between 2 (typically continuous) variables:

# Data:
ir <- as_tibble(iris)
ir
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1         5.10        3.50         1.40       0.200 setosa 
#>  2         4.90        3.00         1.40       0.200 setosa 
#>  3         4.70        3.20         1.30       0.200 setosa 
#>  4         4.60        3.10         1.50       0.200 setosa 
#>  5         5.00        3.60         1.40       0.200 setosa 
#>  6         5.40        3.90         1.70       0.400 setosa 
#>  7         4.60        3.40         1.40       0.300 setosa 
#>  8         5.00        3.40         1.50       0.200 setosa 
#>  9         4.40        2.90         1.40       0.200 setosa 
#> 10         4.90        3.10         1.50       0.100 setosa 
#> # ... with 140 more rows

# Basic scatterplot:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))


# Using 3 different facets:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  facet_wrap(~Species)


# Pimped scatterplot:
ggplot(ir) +
  geom_point(aes(x = Petal.Length, y = Petal.Width, fill = Species), pch = 21, color = "black", size = 2, alpha = 1/2) +
  facet_wrap(~Species) +
  # coord_fixed() + 
  labs(title = "Scatterplot", x = "Length of petal", y = "Width of petal",
       caption = "[Using iris data.]") + 
  theme_bw() +
  theme(legend.position = "none")

More on scatterplots:

Bar plots

Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:

Counts of cases

By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):

library(ggplot2)

## Data: 
ggplot2::mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (1) Count number of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class))


# (b) is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..))


# (c) is the same as:
ggplot(mpg) + 
  geom_bar(aes(x = class), stat = "count")


# (d) is the same as:
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count..), stat = "count")


# (e) pimped version:
ggplot(mpg) + 
  geom_bar(aes(x = class, fill = class), 
           # stat = "count", 
           color = "black") + 
  labs(title = "Counts of cars by class",
       x = "Class of car", y = "Frequency") + 
  scale_fill_brewer(name = "Class:", palette = "Blues") + 
  theme_bw()

Practice: Plot the number or frequency of cases in the mpg data by cyl (in at least 3 different ways).

Proportion of cases

An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:

library(ggplot2)

## Data: 
ggplot2::mpg
#> # A tibble: 234 x 11
#>    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
#>    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
#>  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
#>  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
#>  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
#>  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
#>  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
#>  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
#>  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
#>  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
#>  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
#> 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
#> # ... with 224 more rows, and 1 more variable: class <chr>

# (1) Proportion of cases by class: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..prop.., group = 1))


# is the same as: 
ggplot(mpg) + 
  geom_bar(aes(x = class, y = ..count../sum(..count..)))

Practice: Plot the proportion of cases in the mpg data by cyl (in at least 3 different ways).

Bar plots of existing values

A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").

For instance, let’s plot a bar chart that shows the election data from the following tibble de:

year party share
2013 CDU/CSU 0.415
2013 SPD 0.257
2013 Others 0.328
2017 CDU/CSU 0.330
2017 SPD 0.205
2017 Others 0.465
  1. A version with 2 x 3 separate bars (using position = "dodge"):
## Data: ----- 
de  # => 6 x 3 tibble
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)

## (1) Bar chart with  side-by-side bars (dodge): ----- 

## (a) minimal version: 
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (A) 3 bars per election (position = "dodge"):  
  geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1


## (b) Version with text labels and customized colors: 
bp_1 + 
  ## pimping plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .01), 
            position = position_dodge(width = 1), 
            fontface = 2, color = "black") + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_bw()

  1. A version with 2 bars with 3 segments (using position = "stack"):
## Data: ----- 
de  # => 6 x 3 tibble
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## (2) Bar chart with stacked bars: -----  

## (a) minimal version: 
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
  ## (B) 1 bar per election (position = "stack"):
  geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2


## (b) Version with text labels and customized colors: 
bp_2 +   
  ## Pimping plot: 
  geom_text(aes(label = paste0(round(share * 100, 1), "%")), 
            position = position_stack(vjust = .5),
            color = rep(c("black", "white", "white"), 2), 
            fontface = 2) + 
  # Some set of high contrast colors: 
  scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) + 
  # Titles and labels: 
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") + 
  # coord_flip() + 
  theme_classic()

Bar plots with error bars

It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:

## Create data to plot: ----- 
n_cat <- 6
set.seed(101)

data <- tibble(
  name = LETTERS[1:n_cat],
  value = sample(seq(25, 50), n_cat),
  sd = rnorm(n = n_cat, mean = 0, sd = 8))
data
#> # A tibble: 6 x 3
#>   name  value     sd
#>   <chr> <int>  <dbl>
#> 1 A        34  1.71 
#> 2 B        26  2.49 
#> 3 C        42  9.39 
#> 4 D        40  4.95 
#> 5 E        30 -0.902
#> 6 F        31  7.34

## Error bars: -----

## x-aesthetic only:

# (a) errorbar: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
    geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd), 
                  width = 0.4, color = "orange", alpha = 1, size = 1.0)


# (b) linerange: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
    geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd), 
                   color = "firebrick", alpha = 1, size = 2.5)


## Additional y-aesthetic: 

# (c) crossbar:
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "tomato4") +
    geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                  width = 0.3, color = "sienna1", alpha = 1, size = 1.0)


# (d) pointrange: 
ggplot(data) +
    geom_bar(aes(x = name, y = value), stat = "identity", fill = "burlywood4") +
    geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd), 
                    color = "gold", alpha = 1.0, size = 1.2)

More on barplots:

Drawing curves and lines

ToDo:

  • adding trendlines
  • lines of data (e.g., means)

Box plots

ToDo:

  • show medians, quartiles, distribution, and outliers

Improving plots

Most default plots can be improved by fine-tuning their visual appearance. Popular levers for “pimping” plots include:

  • colors: can be set withing geoms (variable when inside aes(...), fixed outside), choosing or designing specific color scales;
  • labels: labs(...) allows setting titles, captions, axis labels, etc.;
  • legends: can be (re-)moved or edited;
  • themes: can be selected or modified.

More on data visualization

Essential commands

All ds4psy essentials:

Nr. Topic
1. Creating and using tibbles
2. Data transformation
3. Visualizing data

[Last update on 2018-07-06 18:32:08 by hn.]


  1. This is different in Sankey diagrams, shown https://developers.google.com/chart/interactive/docs/gallery/sankey.